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ABSTRACT 

The paper investigates the use of richer syntactic dependen- 
cies in the structured language model (SLM). We present 
two simple methods of enriching the dependencies in the 
syntactic parse trees used for intializing the SLM. We eval- 
uate the impact of both methods on the perplexity (PPL) 
and word-error-rate (WER, N-best rescoring) performance 
of the SLM. We show that the new model achieves an im- 
provement in PPL and WER over the baseline results re- 
ported using the SLM on the UPenn Treebank and Wall 
Street Journal (WSJ) corpora, respectively. 

1. INTRODUCTION 

The structured language model uses hidden parse trees to 
assign conditional word-level language model probabilities. 
As explained in [[[[], Section 4.4.1, the potential reduction 
in PPL — relative to a 3-gram baseline — using the SLM's 
headword parametrization for word prediction is about 40% . 
The key to achieving this is a good guess of the final best 
parse for a given sentence as it is being traversed left-to- 
right. This is much harder than finding the final best parse 
for the entire sentence, as it is sought in a regular statistical 
parser. Nevertheless, it is expected that techniques devel- 
oped in the statistical parsing community that aim at recov- 
ering the best parse for an entire sentence, i.e. as judged by a 
human annotator, should be productive in reducing the PPL 
of the SLM as well. 

In this paper we present a simple and novel way of en- 
riching the probabilistic dependencies in the CONSTRUC- 
TOR component of the SLM showing that it leads to better 
PPL and WER performance of the model. Similar ways of 
enriching the dependency structure underlying the parametriza- 
tion of the probabilistic model used for scoring a given parse 
tree are used in the statistical parsing community [0], [^J. 
Recently, such models [Q], [||] have been shown to outper- 
form the SLM in terms of PPL and WER on the UPenn 
Treebank and Wall Street Journal corpora, respectively. The 
simple modification we present brings the WER performance 
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of the SLM at the same level with the best reported in [Q], 
despite a modest improvement in PPL when interpolating 
the SLM with a 3-gram model. 

The remaining part of the paper is organized as follows: 
Section || briefly describes the SLM. Section || discusses 
the binarization and headword percolation procedure used 
in the standard training of the SLM followed by a descrip- 
tion of the procedure used for enriching the syntactic depen- 
dencies in the SLM. Section |] describes the experimental 
setup and results. Section || discusses the results and indi- 
cates future research directions. 



2. STRUCTURED LANGUAGE MODEL 
OVERVIEW 

An extensive presentation of the SLM can be found in [0]. 
The model assigns a probability P(W, T) to every sentence 
W and its every possible binary parse T. The terminals 
of T are the words of W with POStags, and the nodes of 
T are annotated with phrase headwords and non-terminal 
labels. Let W be a sentence of length n words to which 
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Fig. 1. A word-parse fc-prefix 

we have prepended the sentence begining marker < s > and 
appended the sentence end marker </s> so that wq =<s> 
and w n +i =</s>. Let Wk — WQ...Wk be the word fc- 
prefix of the sentence — the words from the begining of 
the sentence up to the current position k — and WfcTfc the 
word-parse k-prefix. Figure [l] shows a word-parse fc-prefix; 
h_0 . . h_ { -m } are the exposed heads, each head being 
a pair (headword, non-terminal label), or (word, POStag) in 
the case of a root-only tree. The exposed heads at a given 
position fc in the input sentence are a function of the word- 
parse fc-prefix. 



h T J> = (h_i-li.word, NTlabel) 




Fig. 2. Result of adjoin-left under NTlabel 



h'_{-l}=h_i-2 



h'_0 = (h_0.word, NTlabel) 



T"_(-m+lj<-<s> 
<s> 




Fig. 3. Result of adjoin-right under NTlabel 
2.1. Probabilistic Model 

The joint probability P(W, T) of a word sequence W and a 
complete parse T can be broken into: 

P(W, T) = 

Xl n kt\\ P(«; fc /Wit_iT fc _i) • P{t k /W k ^T k ^,w k ) ■ 
]JP(p k /W k -iT k _ u w k ,t k ,p k ...p k _ 1 )} (1) 

i=l 

where: 

• W k -iT k ^i is the word-parse (k — l)-prefix 

• w k is the word predicted by WORD-PREDICTOR 

• t k is the tag assigned to w k by the TAGGER 

• N k - 1 is the number of operations the CONSTRUCTOR 
executes at sentence position k before passing control to the 
WORD-PREDICTOR (the N k -lh operation at position k is 
the null transition); N k is a function of T 

• p k denotes the i-th CONSTRUCTOR operation carried 
out at position k in the word string; the operations per- 
formed by the CONSTRUCTOR are illustrated in Figures g- 
|] and they ensure that all possible binary branching parses 
with all possible headword and non-terminal label assign- 
ments for the W\ . . . w k word sequence can be generated. 
The p\... p k Nk sequence of CONSTRUCTOR operations at 
position k grows the word-parse (k — l)-prefix into a word- 
parse fc-prefix. 

Our model is based on three probabilities, each esti- 
mated using deleted interpolation and parameterized (ap- 
proximated) as follows: 



P(t k /w k , Wfc_iT fc _i) 
P(p k /W k T k ) 



P{w k /ho,h-i) 
P(t k /w k ,h ,h^ 
P(p k /h ,h^) 



(2) 
(3) 
(4) 



It is worth noting that if the binary branching structure de- 
veloped by the parser were always right-branching and we 
mapped the POStag and non-terminal label vocabularies to 
a single type then our model would be equivalent to a tri- 
gram language model. Since the number of parses for a 
given word prefix W k grows exponentially with k, | {T k }\ ~ 



0(2 k ), the state space of our model is huge even for rela- 
tively short sentences, so we had to use a search strategy 
that prunes it. Our choice was a synchronous multi-stack 
search algorithm which is very similar to a beam search. 

The language model probability assignment for the word 
at position k + 1 in the input sentence is made using: 

P S LM(w k+1 /W k ) = ]T P(w k+1 /W k T k ) ■ p(W k ,T k ), 

T k £S k 

p(W k ,T k ) = P(W k T k )/ p (WkT k ) (5) 

which ensures a proper probability over strings W*, where 
S k is the set of all parses present in our stacks at the current 
stage k. 

Each model component — WORD-PREDICTOR, TAG- 
GER, CONSTRUCTOR — is initialized from a set of parsed 
sentences after undergoing headword percolation and bina- 
rization, see Section g. An N-best EM ^ variant is then 
employed to jointly reestimate the model parameters such 
that the PPL on training data is decreased — the likelihood 
of the training data under our model is increased. The re- 
duction in PPL is shown experimentally to carry over to the 
test data. 



HEADWORD PERCOLATION AND 
BINARIZATION 



As explained in the previous section, the SLM is initialized 
on parse trees that have been binarized and the non-terminal 
(NT) tags at each node have been enriched with headwords. 
We will briefly review the headword percolation and bina- 
rization procedures; they are explained in detail in [|IJ. 

The position of the headword within a constituent — 
equivalent with a context-free production of the type 
Z -> Y x . . . Y n , where Z, Y x , . . . Y n are NT labels or POStags 
(only for Yi) — is identified using a rule-based approach. 

Assuming that the index of the headword on the right- 
hand side of the rule is k, we binarize the constituent as fol- 
lows: depending on the Z identity we apply one of the two 
binarization schemes in Figure Q The intermediate nodes 
created by the above binarization schemes receive the NT 
label Z'f\ The choice among the two schemes is made ac- 
cording to a list of rules based on the identity of the label on 
the left-hand-side of a CF rewrite rule. 

Under the equivalence classification in Eq. (Q), the con- 
ditioning information available to the CONSTRUCTOR model 
component is the two most-recent exposed heads consisting 
of two NT tags and two headwords. In an attempt to extend 
the syntactic dependencies beyond this level, we enrich the 
non-terminal tag of a node in the binarized parse tree with 
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Fig. 4. Binarization schemes 

the NT tag of one if its children or both. We distinguish be- 
tween two ways of picking the child from which the NT tag 
is being percolated: 

1 . same : we use the non-terminal tag of the node from 
which the headword is being percolated 

2. opposite: we use the non-terminal tag of the sibling 
node from which the headword is being percolated 

For example, the noun phrase constituent 

(NP 

(DT the) 
(NNP dutch) 
(VBG publishing) 
(NN group) ) 

becomes 

(NP_GROUP 

(DT the) 
(NP'_GROUP 

(NNP dutch) 

(NP'_GROUP (VBG publishing) 
(NN group) ) ) ) 

after binarization and headword percolation and 

(NP+NP'_GROUP 
(DT the) 
(NP' +NP'_GROUP 

(NNP dutch) 

(NP' +NN_GROUP (VBG publishing) 
(NN group) ) ) ) 

or 

(NP+DT_GROUP 
(DT the) 
(NP' +NNP_GROUP 

(NNP dutch) 

(NP' +VBG_GROUP (VBG publishing) 
(NN group) ) ) ) 

after enriching the non-terminal tags using the same and op- 
posite scheme, respectively. 



A given binarized tree is traversed recursively in depth 
first order and each constituent is enriched in the above man- 
ner. The SLM is then initialized on the resulting parse trees. 

Although it is hard to find a direct correspondence be- 
tween the above way of enriching the dependency structure 
of the probability model and the ones used in [|[], [Q] or [g], 
they are similar. 

4. EXPERIMENTS 

We have evaluated the PPL performance of the model on the 
UPenn Treebank and the WER performance in the setups 
described in respectively. 

4.1. Perplexity experiments on the UPenn Treebank 

For convenience, we chose to evaluate the performance of 
the enriched SLM on the UPenn Treebank corpus [Q] — a 
subset of the Wall Street Journal (WSJ) corpus [§]. 

We have evaluated the perplexity of the two different 
ways of enriching the non-terminal tags in the parse tree 
and of using both of them at the same time. For each way of 
initializing the SLM we have performed 3 iterations of N- 
best EM. The word and POS-tagger vocabulary sizes were 
10,000 and 40, respectively. The NT tag/CONSTRUCTOR 
operation vocabulary sizes were 52/157, 954/2863, 712/2137, 
3816/11449 for the baseline, opposite, same and both ini- 
tialization schemes, respectively. The SLM is interpolated 
with a 3-gram model — built on exactly the same training 
data/word vocabulary as the SLM — using a fixed interpo- 
lation weight: 

P(-) = A • P 3gram (-) + (1 - A) ■ Pslm{-) 

The results are summarized in Table [|. The baseline model 
is the standard SLM as described in As can be seen, 



Model 


Iter 


A = 0.0 


A = 0.6 


A= 1.0 


baseline 





167.38 


151.89 


166.63 


baseline 


3 


158.75 


148.67 


166.63 


opposite 





157.61 


146.99 


166.63 


opposite 


3 


150.83 


144.08 


166.63 


same 





163.31 


149.56 


166.63 


same 


3 


155.29 


146.39 


166.63 


both 





160.48 


147.52 


166.63 


both 


3 


153.30 


144.99 


166.63 



Table 1. Deleted Interpolation 3-gram + SLM; PPL Results 

the model initialized using the opposite scheme performed 
best, reducing the PPL of the SLM by 5% relative to the 
SLM baseline performance. However the improvement in 
PPL is minor after interpolating with the 3-gram model. 



Model 


Iter 


Interpolation weight 


0.0 


0.2 


0.4 


0.6 


0.8 


1.0 


baseline SLM WER, % 





13.1 


13.1 


13.1 


13.0 


13.4 


13.7 


opposite SLM WER, % 





12.7 


12.8 


12.7 


12.7 


13.1 


13.7 


MPSS significance test p-value 


0.020 


0.017 


0.014 


0.005 


0.070 





Table 2. Back-off 3-gram + SLM; N-best rescoring WER Results and Statistical Significance 



4.2. N-best rescoring results 

We chose to evaluate in the WSJ DARPA'93 HUB1 test 
setup. The size of the test set is 213 utterances, 3446 words. 
The 20kwds open vocabulary and baseline 3-gram model 
— used for generating the lattices and the N-best lists — 
are the standard ones provided by NIST and LDC — see |Jl|] 
for details. The SLM was trained on 20Mwds of WSJ text 
automatically parsed using the parser in [Q], binarized and 
enriched with headwords and the opposite NT tag informa- 
tion as explained in Section |[ The results are presented in 
Table g 

Since the rescoring experiments are expensive, we have 
only evaluated the WER performance of the model intial- 
ized using the opposite scheme. The enriched SLM achieves 
0.3-0.4% absolute reduction in WER over the performance 
of the baseline SLM and a full 1 .0% absolute over the base- 
line 3 -gram model, for a wide range of values of the inter- 
polation weight. We note that the performance of the SLM 
as a second pass language model is the same even without 
interpolating it with the 3-gram modelQ (A = 0.0). 

We have evaluated the statistical significance of the re- 
sults relative to the 3-gram baseline using the standard test 
suite in the SCLITE package provided by NIST. We believe 
that for WER statistics the most relevant significance test 
is the Matched Pair Sentence Segment one. The results are 
presented in Table g. As it can be seen the improvement 
achieved by the SLM is highly significant at all values of 
the interpolation weight A except for A = 0.8. 

5. CONCLUSIONS AND FUTURE DIRECTIONS 

We have presented a simple but effective method of enrich- 
ing the syntactic dependencies in the structured language 
model (SLM) that achieves 0.3-0.4% absolute reduction in 
WER over the best previous results reported using the SLM 
on WSJ. The implementation could be greatly improved 
by predicting only the relevant part of the enriched non- 
terminal tag and then adding the part inherited from the 
child. A more comprehensive study of the most produc- 
tive ways of increasing the probabilistic dependencies in the 
parse tree would be desirable. 

2 The N-best lists are generated using the baseline 3-gram model so this 
is not indicative of the performance of the SLM as a first pass language 
model. 
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